The degrees of freedom of the Lasso for general 

design matrix 



C. Dossal^ M. KachcW 2 ), M.J. Fadili^, G. Peyre^ 3 ) and C. Chesneau^ 



(1) 1MB, CNRS-Univ. Bordeaux 1 
351 Cours de la Liberation, F-33405 Talence, 
France 

Charles . Dossal Omath.u-bordeauxl . f r 



(3) Ceremade, CNRS-Univ. Paris-Dauphinc 
Place du Marechal De Lattre De Tassigny, 

75775 Paris 16, France 
Gabriel . PeyreOceremade . dauphine . f r 



(2) GREYC, CNRS-ENSICAEN-Univ. Caen 
6 Bd du Marechal Juin, 14050 Caen, France 
Jalal . Fadili §greyc . ensicaen . f r 
Maher . KachourOgreyc . ensicaen. f r 



(4) LMNO, CNRS-Univ. Caen 
Departement de Mathematiques, UFR de 

Sciences, 14032 Caen, France 
Chesneau. ChristopheOmath.unicaen. f r 



Abstract 

In this paper, we investigate the degrees of freedom (dof ) of penalized i\ minimization (also 
known as the Lasso) for linear regression models. We give a closed-form expression of the dof 
of the Lasso response. Namely, we show that for any given Lasso regularization parameter A 
and any observed data y belonging to a set of full (Lebesgue) measure, the cardinality of the 
support of a particular solution of the Lasso problem is an unbiased estimator of the degrees 
of freedom. This is achieved without the need of uniqueness of the Lasso solution. Thus, 
our result holds true for both the underdetermined and the overdetermined case, where the 
latter was originally studied in |32j . We also show, by providing a simple counterexample, that 
although the dof theorem of |32| is correct, their proof contains a flaw since their divergence 
formula holds on a different set of a full measure than the one that they claim. An effective 
estimator of the number of degrees of freedom may have several applications including an 
objectively guided choice of the regularization parameter in the Lasso through the SURE 
framework. Our theoretical findings are illustrated through several numerical simulations. 
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1 Introduction 



1.1 Problem statement 

We consider the following linear regression model 

y = Ax°+e, n = Ax° : (1) 

where y 6 R." is the observed data or the response vector, A — (ai, • • • ,a p ) is an n x p design 
matrix, x° = (X®, • • • , x^j is the vector of unknown regression coefficients and e is a vector of i.i.d. 
centered Gaussian random variables with variance a 2 > 0. In this paper, the number of observations 
n can be greater than the ambient dimension p of the regression vector to be estimated. Recall 
that when n < p. ([IJ is an underdetermined linear regression model, whereas when n > p and all 
the columns of A are linearly independent, it is overdetermined. 

Let x{y) be an estimator of x°, and fi(y) — Ax(y) be the associated response or predictor. The 
concept of degrees of freedom plays a pivotal role in quantifying the complexity of a statistical 
modeling procedure. More precisely, since y ~ A/"(/.t = Ax°, er 2 Id nX n) (Id n xn is the identity on 
K n ), according to j8], the degrees of freedom (dof) of the response fi(y) is defined by 

df= y ^<mM . (2) 

i=i 

Many model selection criteria involve df, e.g. C p (Mallows [14]), AIC (Akaike Information Criterion, 
[T]), BIC (Bayesian Information Citerion, [221), GCV (Generalized Cross Validation, (3]) and SURE 
(Stein's unbiased risk estimation |23| . see Section |2.2[ ). Thus, the dof is a quantity of interest in 
model validation and selection and it can be used to get the optimal hyperparameters of the 
estimator. Note that the optimality here is intended in the sense of the prediction jE2(y) and not 
the coefficients x(y). 

The well-known Stein's lemma [23] states that if y M> fi(y) is weakly differentiable then its 
divergence is an unbiased estimator of its degrees of freedom, i.e. 

rf/(y)=div( / 7(y))=^ 9 ^M, and E(df(y)) = df. (3) 



Here, in order to estimate a; , we consider solutions to the Lasso problem, proposed originally 
in |26| . The Lasso amounts to solving the following convex optimization problem 

mmJlly-Arlll + AllxlK, (Pi(j/,A)) 

where A > is called the Lasso regularization parameter and || • H2 (resp. || • ||i) denotes the £2 
(resp. £1) norm. An important feature of the Lasso is that it promotes sparse solutions. In the 
last years, there has been a huge amount of work where efforts have focused on investigating the 
theoretical guarantees of the Lasso as a sparse recovery procedure from noisy measurements. See, 
e.g., [9l[a[30l[ai[l9j[l6l[IIl|7l[ni|27], to name just a few. 

1.2 Contributions and related work 

Let fixiy) — Ax\(y) be the Lasso response vector, where x\(y) is a solution of the Lasso problem 



(Pi(y, A)). Note that all minimizcrs of the Lasso share the same image under A, i.e. ji\(y) is 
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uniquely denned; see Lemma[2]in Section[5]for details. The main contribution of this paper is first 
to provide an unbiased estimator of the degrees of freedom of the Lasso response for any design 
matrix. The estimator is valid everywhere except on a set of (Lebesgue) measure zero. We reach 
our goal without any additional assumption to ensure uniqueness of the Lasso solution. Thus, our 
result covers the challenging underdetcrmincd case where the Lasso problem does not necessarily 
have a unique solution. It obviously holds when the Lasso problem (Pi(y, A) I has a unique solution, 
and in particular in the overdetermined case studied in (32] • Using the estimator at hand, we also 
establish the reliability of the SURE as an unbiased estimator of the Lasso prediction risk. 

While this paper was submitted, we became aware of the independent work of Tibshirani and 
Taylor [25], who studied the dof for general A both for the Lasso and the general (analysis) Lasso. 

Section [3] is dedicated to a thorough comparison and discussion of connections and differences 
between our results and the one in |3"2l Theorem 1] for the overdetermined case, and that of 
HH HI] for the general case. 



1.3 Overview of the paper 

This paper is organized as follows. Section[2]is the core contribution of this work where we state our 
main results. There, we provide the unbiased estimator of the dof of the Lasso, and we investigate 
the reliability of the SURE estimate of the Lasso prediction risk. Then, we discuss relation of 
our work with concurrent one in the literature in Section [3] Numerical illustrations are given in 
Section |4j The proofs of our results are postponed to Section [5j A final discussion and perspectives 
of this work are provided in Section [6] 



2 Main results 

2.1 An unbiased estimator of the dof 

First, some notations and definitions are necessary. For any vector x, Xi denotes its ith component. 
The support or the active set of x is defined by 

/ = supp(a;) ={i:xi^ 0}, 

and we denote its cardinality as |supp(x)| = We denote by xj G W 1 ' the vector built by 
restricting x to the entries indexed by /. The active matrix Aj = (a^gj associated to a vector x 
is obtained by selecting the columns of A indexed by the support / of x. Let - T be the transpose 
symbol. Suppose that Ai is full column rank, then we denote the Moore-Penrose pseudo-inverse 
of Aj, Aj = (AjAi)~ 1 Aj. sign(-) represents the sign function: sign(a) = 1 if a > 0; sign(a) = 
if a — 0; sign(a) = — 1 if a < 0. 

For any / C {1, 2, • • • ,p}, let Vi = span(Aj), Py I the orthogonal projector onto Vj and P v ± that 
onto the orthogonal complement V/-. 

Let S G {-Ll} m be a sign vector, and j G {1, 2, • • • , p}. Fix A > 0. We define the following 
set of hyperplanes 

H Id , s = {u G R" : (P v x(aj),u) = ±A(1 - (aj, (A+) T S))}. (4) 

Note that, if aj does not belong to Vj, then Hi j § becomes a finite union of two hyperplanes. Now, 
we define the following finite set of indices 

n = {(I,j,S): aj ?Vi} (5) 
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and let Ga be the subset of R" which excludes the finite union of hyperplanes associate to Q, that 
is 

G A =R n \ (J H IdtS . (6) 

To cut a long story short, \J,j ^ Hi,j,s is a set of (Lebesgue) measure zero (Hausdorff dimen- 
sion n — 1), and therefore Ga is a set of full measure. 

We are now ready to introduce our main theorem. 



Theorem 1. Fix A > 0. For any y € Ga, consider M y ,\ the set of solutions of (Pi(y, A) I. Let 
x* x (y) G M^y,\ with support I* such that Aj* is full rank. Then, 

l 7 *l = ^/ I P?. |supp(x A (y))|. (7) 

Furthermore, there exists e > such that for all z G Ball(y,e), the n- dimensional ball with center 
y and radius e, the Lasso response mapping z t— > Jl\{z) satisfies 

Mz) = Mv) + Pv I .{z-y)- (8) 

As stated, this theorem assumes the existence of a solution whose active matrix Aj* is full rank. 
This can be shown to be true; see e.g. [51 Proof of Theorem 1] or [IB Theorem 3, Section B.lQ It 
is worth noting that this proof is constructive, in that it yields a solution x* x (y) of ( Pi(^, A)] ) such 



that A[* is full column rank from any solution x\(y) whose active matrix has a nontrivial kernel. 
This will be exploited in Section [4] to derive an algorithm to get x* x (y) , and hence I* . 

A direct consequence of our main theorem is that outside Ga, the mapping jl\(y) is G°° and 
the sign and support are locally constant. Applying Stein's lemma yields Corollary [l] below. The 
latter states that the number of nonzero coefficients of x* x (y) is an unbiased estimator of the dof 
of the Lasso. 

Corollary 1. Under the assumptions and with the same notations as in Theorem^ we have the 
following divergence formula 

df x (y):=div(My)) = \I*\- (9) 

Therefore, 

df = E(df x (y))=E(\I*\). (10) 

Obviously, in the particular case where the Lasso problem has a unique solution, our result 
holds true. 

2.2 Reliability of the SURE estimate of the Lasso prediction risk 

In this work, we focus on the SURE as a model selection criterion. The SURE applied to the Lasso 
reads 

SURE(£ A (y)) = -no 2 + ||/x A (y) - y\\ 2 2 + 2a 2 df x (y), (11) 

where df(y) is an unbiased estimator of the dof as given in Corollary [l] It follows that the 
SURE(/2a(2/)) is an unbiased estimate of the prediction risk, i.e. 

Risk( M ) = E (||/XA(y) - = E (SURE(/x a (j/))) . 



1 This proof is alluded to in the note at the top of 1211 Page 363]. 
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We now evaluate its reliability by computing the expected squared-error between SURE(/i,\(y)) 
and SE(/xa(j/))i the true squared-error, that is 



se(Mv)) = WMv) - (12) 

Theorem 2. Under the assumptions of Theorem^ we have 

E ((SURE(maO/)) - SE(/2 A (y))) 2 ) = -2a 4 n + 4cr 2 E (||/z A (y) - y\\j) + 4a 4 E (|J*|) . (13) 



Moreover, 



E f pURE(^fa))-SE(^(,))\ 2 \ = /l . (14) 



3 Relation to prior work 
Overdetermined case |32| 

The authors in studied the dof of the Lasso in the overdetermined case. Precisely, when n > p 
and all the columns of the design matrix A are linearly independent, i.e. rank(v4) = p. In fact, in 
this case the Lasso problem has a unique minimizer x\(y) = X\(y) (see Theorem [I]). 

Before discussing the result of [32J, let's point out a popular feature of x\(y) as A varies in 
]0,+oo[: 

• For A > HA^Iloo, the optimum is attained at x\(y) = 0. 

• The interval ] 0, || ^4 T y||oo [ is divided into a finite number of subintervals characterized by the 
fact that within each such subinterval, the support and the sign vector of x\(y) are constant. 
Explicitly, let (A m ) < m <^ be the finite sequence of A's values corresponding to a variation 
of the support and the sign of x\ (y) , defined by 

P T y||oo = A > Ai > A 2 > • • • > X K = 0. 

Thus, in ]A m +i, A m [, the support and the sign of x\(y) are constant, see |7j 117} [18]. Hence, 
we call (X m ) 0<m<K the transition points. 

Now, let A G]A m+ i, A m [. Thus, from Lemma [l] (see Section [5]), we have the following implicit form 

of x\(y), 

(xx(y))i m - A+ m y - X(Aj m Ai m )~ 1 Sm, (15) 
where I m and S m are respectively the (constant) support and sign vector of x\(y) for A e]A m +i, A m [. 



Hence, based on (15 1, [32 showed that for all A > 0, there exists a set of measure zero J\f\, which 



is a finite collection of hyperplanes in K™ , and they defined 

lC x =R n \Af x , (16) 
so that V y £ JC\, A is not any of the transition points. 

Then, for the overdetermined case, [32] stated that for all y £ JC\, the number of nonzero coefficients 
of the unique solution of ( |Pi(y, A)| is an unbiased estimator of the dof. In fact, their main argument 
is that, by eliminating the vectors associated to the transition points, the support and the sign of 
the Lasso solution are locally constant with respect to y, see |32l Lemma 5]. 

We recall that the overdetermined case, considered in [32] , is a particular case of our result 
since the minimizer is unique. Thus, according to the Corollary [T| we find the same result as |32| 
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but valid on a different set y 6 G\ = R™ \ U(ij s)£f2 Hi,j,s- A natural question arises: can we 
compare our assumption to that of [32] ? In other words, is there a link between 1C\ and G\ ? 

The answer is that, depending on the matrix A, these two sets may be different. More impor- 
tantly, it turns out that although the dof formula |32l Theorem 1] is correct, unfortunately, their 
proof contains a flaw since their divergence formula [32] Lemma 5] is not true on the set K.\. We 
prove this by providing a simple counterexample. 

Example of vectors in G\ but not in K,\ Let {ei, e-i\ be an orthonormal basis of M 2 and let's 
define ai = e\ and a 2 = e\ + 62, and A the matrix whose columns are ai and a 2 . 

Let's define / = {1}, j — 2 and S = 1. It turns out that A^ = a\ and ((A'f^S, aj) = 1 which 
implies that for all A > 0, 

Hi,j,s = {i£R" : (Py-L[aj),u) = 0} = span(ai) . 

Let y = aai with a > 0, for any A > 0, y G -Hr.j.s ( or equivalently here y G\). Using Lemma [T] 
(see Section^, one gets that for any A G]0,a[, the solution of (Pi(y, A) I is x\(y) — (a — A,0) and 



that for any A > a, x\(y) — (0,0). Hence the only transition point is A — a. It follows that for 
A < a, y belongs to K,\ defined in [32], but y ^ G\. 

We prove then that in any ball centered at y, there exists a vector z\ such that the support of 



the solution of (Pi(zi, A)) is different from the support of (Pi(y, A) ). 

Let's choose A < a and e €]0, a — A[ and let's define z\ = y + ee2- From Lemmajl] (see Section[s|, 
one deduces that the solution of (Pi (21, A)) is equal to x\(zi) = (a — A — e,e) whose support is 
different from that of x\(y) — (a — A, 0). 

More generally, when there are sets {I,j, S} such that ((Af) T S, a,j) = 1, a difference between 
the two sets G\ and JC\ may arise. Clearly, G\ is not only the set of transition points associated 
to A. 

According to the previous example, in this specific situation, for any A > there may exist 
some vectors y that are not transition points associated to A where the support of the solution 
of [P\{y, A)) is not stable to infinitesimal perturbations of y. This situation may occur for under 



or overdetermined problems. In summary, even in the over determined case, excluding the set of 
transition points is not sufficient to guarantee stability of the support and sign of the Lasso solution. 

General case [12l f25l [28] 

In [12] . the author studies the degrees of freedom of a generalization of the Lasso where the 
regression coefficients are constrained to a closed convex set. When the latter is a ball and 
p > n, he proposes the cardinality of the support as an estimate of df but under a restrictive 
assumption on A under which the Lasso problem has a unique solution. 
In [25} Theorem 2], the authors proved that 

df = E(rank(A 7 )) 



where I = I{y) is the active set of any solution x\(y) to ( Pi(^, A)) . This coincides with Corollary [T] 



when Aj is full rank with rank(A/) = rank(A/«). Note that in general, there exist vectors y € K™ 
where the smallest cardinality among all supports of Lasso solutions is different from the rank of 
the active matrix associated to the largest support. But these vectors are precisely those excluded 
in G\. In the case of the generalized Lasso (a.k.a. analysis sparsity prior in the signal processing 



G 




Figure 1 : A counterexample for n = p = 2 of vectors in G\ but not in IC\ . See text for a detailed 
discussion. 



community), Vaitcr et al. [28, Corollary 1] and Tibshirani and Taylor |25l Theorem 3] provide 
a formula of an unbiased estimator of df. This formula reduces to that of Corollary [T] when the 
analysis operator is the identity. 



4 Numerical experiments 

Experiments description In this section, we support the validity of our main theoretical find- 
ings with some numerical simulations, by checking the unbiasedness and the reliability of the SURE 
for the Lasso. Here is the outline of these experiments. 

For our first study, we consider two kinds of design matrices A, a random Gaussian matrix 
with n — 256 and p — 1024 whose entries are Af(0,l/n), and a deterministic convolution 
design matrix A with n = p = 256 and a Gaussian blurring function. The original sparse vector 
x° was drawn randomly according to a mixed Gaussian-Bernoulli distribution, such that x° is 
15-sparse (i.e. | supp(a: ) = 15|). For each design matrix A and vector x°, we generate K = 100 
independent replications y k g K™ of the observation vector according to the linear regression model 
0. Then, for each y k and a given A, we compute the Lasso response fi\(y k ) using the now popular 
iterative soft-thresholding algorithm [4j^] and we compute SURE(^A(j/ fc )) and SE(li\(y k )). We then 
compute the empirical mean and the standard deviation of (SURE(/2 / \(y' c ))) 1<k<K > the empirical 
mean of (SE(/I>(i;' £ ))) 1<k<K , which corresponds to the computed prediction risk, and we compute 
Rt the empirical normalized reliability on the left-hand side of ([131, 



Ri ^ly / SUREffi A fa fc ))-SEffi A (^)) \ 2 



k=l 



Moreover, based on the right-hand side of (13 1, we compute Rt as 



** = -\ + f^E (iifcGrt - y k \\i)) + 4* ( ± t n)) , (is) 



K ^— ' v ' I n A \ K 

k=l / \ k=l 



where at the fcth replication, \I*\k is the cardinality of the support of a Lasso solution whose active 
matrix is full column rank as stated in Theorem [T] Finally, we repeat all these computations for 

2 Iterative soft-thresholding through block-coordinate relaxation was proposed in |21| for matrices A structured 
as the union of a finite number of orthonormal matrices. 
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SURE 1 realization Reliability/(n 2 g 4 ) 




10° 10° 1 2 3 4 5 

X/a X/a X/a 



(b) Convolution 

Figure 2: The SURE and its reliability as a function of A for two types of design matrices, (a) 
Gaussian; (b) Convolution. For each kind of design matrix, we associate three plots. 

various values of A, for the two kinds of design matrices considered above. 



Construction of full rank active matrix As stated in the discussion just after Theorem [T] 
in situations where the Lasso problem has non-unique solutions, and the minimization algorithm 
returns a solution whose active matrix is rank deficient, one can construct an alternative optimal 
solution whose active matrix is full column rank, and then get the estimator of the degrees of 
freedom. 

More precisely, let x\ (y) be a solution of the Lasso problem with support / such that its active 
matrix Aj has a non-trivial kernel. The construction is as follows: 

1. Take h £ ker Aj such that supp h C I. 

2. For i e I, Ax\(y) = A(x\{y) +th) and the mapping t ||^a(2/) + th\\\ is locally affine 
in a neighborhood of 0, i.e. for \t\ < minjgj j (5?a I / II ^11 oo ■ %\{y) being a minimizer of 
(Pi(y, A) I, this mapping is constant in a neighborhood of 0. We have then constructed a 
whole collection of solutions to (Pi(y, A) I having the same image and the same l\ norm, 
which lives on a segment. 

3. Move along h with the largest step to > until an entry of x\(y) = x\(y) + t^h vanishes, i.e. 
swpp(x\(y) +t h) C I. 

4. Repeat this process until getting a vector x\(y) with a full column rank active matrix Aj*. 
Note that this construction bears similarities with the one in |20|. 
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(c) A/cr = 10 

9 

;ure 3: The SURE and its reliability as a function of the number of observations 



Results discussion Figure [2] depicts the obtained results. For each design matrix, we associate a 
panel, each containing three plots. Hence, for each case, from left to right, the first plot represents 
the SURE for one realization of the noise as a function of A. In the second graph, we plot the com- 
puted prediction risk curve and the empirical mean of the SURE as a function of the regularization 
parameter A. Namely, the dashed curve represents the calculated prediction risk, the solid curve 
represents the empirical mean of the SURE, and the shaded area represent the empirical mean 
of the sure ± the empirical standard deviation of the SURE. The latter shows that the SURE 
is an unbiased estimator of the prediction risk with a controlled variance. This suggests that the 
SURE is consistent, and then so is our estimator of the degrees of freedom. In the third graph, 



we plot the theoretical and empirical normalized reliability, defined respectively by (17 1 and (18) 



as a function of the regularization parameter A. More precisely, the solid and dashed blue curves 



represent respectively Rt and Rt- This confirms numerically that both sides (Rt and R?) of (13 1 
indeed coincide. 

As discussed in the introduction, one of the motivations of having an unbiased estimator of the 
degrees of freedom of the Lasso is to provide a data-driven objective way for selecting the optimal 
Lasso regularization parameter A. For this, one can compute the optimal A that minimizes the 
SURE, i.e. 

A optlmal = argmin SURE(/2 A (y)). (19) 

A>0 

In practice, this optimal value can be found either by a exhaustive search over a fine grid, or alter- 
natively by any dicothomic search algorithm (e.g. golden section) if A n- SURE(/Ia(2/)) is unimodal. 

Now, for our second simulation study, we consider a partial Fourier design matrix, with n < p 
and a constant underdeterminacy factor p/n = 4. x° was again simulated according to a mixed 
Gaussian-Bernoulli distribution with [O.lp] non-zero entries. For each of three values of A/er e 
{0.1, 1, 10} (small, medium and large), we compute the prediction risk curve, the empirical mean 
of the SURE, as well as the values of the normalized reliability Rt and Rt, as a function of 
n € {8, • • • , 1024}. The obtained results are shown in Figure [3] For each value of A, the first plot 
(top panel) displays the normalized empirical mean of the SURE (solid line) and its 5% quantiles 
(dotted) as well as the computed normalized prediction risk (dashed) . Unbiasedness is again clear 
whatever the value of A. The trend on the prediction risk (and average SURE) is in agreement 
with rates known for the Lasso, see e.g. j2]. The second plot confirms that the SURE is an 
asymptotically reliable estimate of the prediction risk with the rate established in Theorem [2j 



Moreover, as expected, the actual reliability gets closer to the upper-bound (48 1 as the number of 
samples n increases. 

5 Proofs 

First of all, we recall some classical properties of any solution of the Lasso (see, e.g., |17ll71 [TTll27| ). 
To lighten the notation in the two following lemmas, we will drop the dependency of the minimizers 



of (Pi(y, A) I on either A or y. 



Lemma 1. x is a (global) minimizer of the Lasso problem (Pi(j/, A)) if and only of: 

1. Aj(y — Ax) = Asign(a;/) ; where I — {i : Xi ^ 0}, and 

2. \( aj ,y-Ax)\ <X, Vjef, 
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where I c — {1, . . . ,p}\I . Moreover, if Aj is full column rank, then x satisfies the following implicit 
relationship: 

XI = A+y - XiAjAj)- 1 sign(zj) . (20) 

Note that if the inequality in condition |2] above is strict, then x is the unique minimizer of the 
Lasso problem (Pi(y, A) I 

Lemma [2] below shows that all solutions of (Pi(y, A) I have the same image by A. In other 
words, the Lasso response fl\{y), is unique, see [5]. 

Lemma 2. Ifx 1 and x 2 are two solutions of $Pi(y, \)\ , then 

Ax 1 = Ax 2 = fi.\(y). 

Before delving into the technical details, we recall the following trace formula of the divergence. 
Let Jp(y) be the Jacobian matrix of a mapping y i— > Ji(y), defined as follows 

( J m)i,j : =-g^-' *, j = 1, •••,»». (21) 

Then we can write 

div(/i(y))=tr(jp (2/) ). (22) 

Proof of Theorem [1} Let x* x (y) be a solution of the Lasso problem ( |Pi(y, A)) and I* its support 
such that Ai* is full column rank. Let (x* x (y))i* be the restriction of x* x (y) to its support and 
S* = sign ((x* x (y)) i*) ■ From Lemma[2]we have, 

According to Lemma [T] we know that 

Al(y-(l x (y)) = \S*; 
|(a fc ,2/-M A (y))|<A,Vfce(r) c . 



Furthermore, from (20 I, we get the following implicit form of x* x (y) 

(xKy))!, =A+y-\(AlA I .)- 1 S*. (23) 

It follows that 

Mv) = Pvx.(v)-Mi., s ., (24) 

and 

f a (y)=y- ma (y) = P Vl \ (V) + Ad/* ,s- , (25) 
where dj*^* = (A^) T S*. We define the following set of indices 

J = {j:\{ aj ,r x (y))\=X}. (26) 

From Lemma [T] we deduce that 

I* C J. 

Since the orthogonal projection is a self-adjoint operator and from (25 1, for all j € J, we have 

\(P Vi L{a j ),v)+\(a s ,d I .,s*)\ = \. (27) 
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As y £ G\, we deduce that if j £ Jfl (I*) c then inevitably we have 



dj £ Vi*, and therefore \ (a,j, di* t s*)\ = 1. (28) 



In fact, if a 3 Vj» then (I*,j,S*) £ f2 and from (27 1 we have that y £ -ffpj.s'*, which is a 
contradiction with y £ G\. 

Therefore, the collection of vectors (ai)ie/* forms a basis of Vj = span(aj)j e j. Now, suppose that 



x\(y) is another solution of (Pi(y, A) I, such that its support I is different from I*. If Aj is full 
column rank, then by using the same above arguments we can deduce that (a^)^/ forms also a 
basis of Vj. Therefore, we have 

|/| = |I*| = dim(V». 

On the other hand, if A] is not full rank, then there exists a subset Jo C I such that Aj is full 
rank (see the discussion following Theorem [l]) and (<Zj)jej forms also a basis of Vj, which implies 
that 

|/|>|/ |=dim(V r j) = |r*|. 



We conclude that for any solution x\(y) of (Pi(y, A) ), we have 



|supp(a? A (y))| > |/*|, 



and then |/*| is equal to the minimum of the cardinalities of the supports of solutions of (Pi(y, A) ). 
This proves the first part of the theorem. 

Let's turn to the second statement. Note that G\ is an open set and all components of {x\{y))x* 
are nonzero, so we can choose a small enough e such that Ball(y,e) C G\, that is, for all z £ 
Ball(j/,e), z £ G\. Now, let x\(z) be the vector supported in I* and defined by 

(xKz))!. =A+z- X(Al Ar.y'S* = (x* x (y)) T . + A+(z - y). (29) 

If s is small enough, then for all z £ Ball(y,e), we have 

signOr^z)),. = rign^d,))^ = S* . (30) 

In the rest of the proof, we invoke Lemma[l]to show that, for e small enough, x\(z) is actually 
a solution of (Pi(z, A)). First we notice that z — Ax\(z) = P v ±(z) + Adi*^'*- It follows that 

Aj,(z — Ax\(z)) = XAj,dj^ S ' = A5 1 * = Asign(x^(z))/.. (31) 



Moreover for all j £ J (1 1* , from ( 28 1 , we have that 



and for all j ^ J 



\{a jt z-Axi(z))\ = \{a j ,P V J i (z) + Xd I ,, s *)\ 

= \{ p vJ-S a i)' z ) + H a 31 d I*,S*)\ 
= \\(a,j,di* t s*)\ = A. 



\(a S) z - Ax\{z))\ < \{a h y - Ax* x (y))\ + \(P V ± (a,-), z - y)\ 



Since for all j ^ J, \(a,j,y — Ax* x ) \ < A, there exists e such that for all z £ Ball(y, e) and V j ' ^ J, 
we have 

\{a h z-Ax\{z))\ < A. 
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Therefore, we obtain 



\( aj ,z-Ax{(z))\ <A,Vj€ (I*) c 



Which, by Lemma [T] means that x\{z) is a solution of (Pi(z, A)), and the unique Lasso response 
associated to {P\{z, A)), denoted by fi\(z), is defined by 



fi\(z) = Pvj» (z) - Ad 7 . 



(32) 



Therefore, from (24 1 and (32), we can deduce that for all z G Ball(y,e) we have 

fi\(z) = Ux(y) + Pvj, iz-y). 



□ 



Proof of Corollary^ We showed that there exists e sufficiently small such that 

\\z - vh < £ =>• fix{z) = My) + Pv t . {z - y)- (33) 

Let h £ Vi* such that \\h\\2 < e and z = y + h. Thus, we have that \\z — y\\ 2 < £ and then 

- MA(y)|| 2 = \\P Vl , (h)h = ||A||a < £■ (34) 

Therefore, the Lasso response ji\(y) is uniformly Lipschitz on G\. Moreover, fl\{y) is a continuous 
function of y, and thus j2\(y) is uniformly Lipschitz on R". Hence, fi\(y) is almost differentiable; 
see [T5] and [7]. 

On the other hand, we proved that there exists a neighborhood of y, such that for all z in this 
neighborhood, there exists a solution of the Lasso problem (Pi(z, A)), which has the same support 
and the same sign as x\(y), and thus fl\{z) belongs to the vector space V/*, whose dimension 
equals to \I*\, see (24 1 and (32). Therefore, Ji\(y) is a locally affine function of y, and then 

•hx(v) = p v,, ■ 



(35) 



Then the trace formula ( 22 ) implies that 

div(M A (y))=tr(P Vx ,) = m 



(36) 



This holds almost everywhere since G\ is of full measure, and (10) is obtained by invoking Stein's 
lemma. □ 

Proof of Theorem^ First, consider the following random variable 

QiiMy)) = \\My)\\l + IHI2 - z{y,My)) + 2a 2 div(ji x (y))- 

From Stein's lemma, we have 

E(e,My)) =^ 2 E (div(M A (y))) . 

Thus, we can deduce that Qi(J2\(y)) and SURE(/2a(2/)) are unbiased estimator of the prediction 
risk, i.e. 

E (SURE(/2 A (y))) = E (Qi (&(»))) = E (SE(£ A (y))) = Risk( M ). 
Moreover, note that SURE(/2 A (y)) - Qi (&(»)) = \\y\\j-E (\\y\\ 2 2 ), where 



E(ibiii) 



liuHl, andVflyi) =2a 4 [n + 2 



Mil 



(37) 
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Now, we remark also that 

Qx(Mv)) - SE02aGO) = 2 {° 2 div(/2 A (y)) - (e, ju A (y))) . (38) 
After an elementary calculation, we obtain 

E(SURE02 A (y)) - SE(£ A (y))) 2 = E(Q!(/x A (y)) - SE(£ A (y))) 2 + V (||y|| 2 ) + 4T, (39) 

where 

T = (T 2 E (div(/i A (y))||j/|| 2 ) - E ((s,My))\\y\\l) =Ti + T 2 , (40) 

with 

Ti = 2 (a 2 E(divG2 A (y))( £ , M )) - E (< £ , /2 A (y))< £ , /;))) (41) 

and 

T 2 = ( x 2 E(div(/I A ( 2/ ))|| £ || 2 ) -E((e,/I A ( 2 /))|| £ || 2 ) . (42) 

Hence, by using the fact that a Gaussian probability density <p(£i) satisfies £j</?(£j) = — a 2 (p'(ei) 
and integrations by parts, we find that 

T\ = -2a 2 E((£ A , M )) 

and 

T 2 = -2 < 7 4 E(div(/i A (y))). 

It follows that 

T = -2(T 2 (E ((/2 a , M )) + a 2 E (div(/2 A (y))) ) . (43) 
Moreover, from |13[ Property 1], we know that 

E(Qi(maO/)) - SE(/x A (y))) 2 = 4a 2 (e (\\My)\\l) + ^ (tr ((^ A (y)) 2 )) ) , (44) 

Thus, since Jp x ( y ) = Pvj» which is an orthogonal projector (hence self-adjoint and idempotent), 
we have tr ((^ A (t/)) 2 ) = div(/I A (y)) = \I*\. Therefore, we get 

E(Qi(/x A (y)) - SE(£ A (y))) 2 = 4a 2 (E (WMvM) + ^ (I'D) ■ (45) 
Furthermore, observe that 

E (SURE(/z A (y))) = -no 2 + E (||/z A (y) - y|| 2 ) + 2ct 2 E (|J*|) . (46) 



Therefore, by combining (37 1, (39 1, (43 1 and (45 1, we obtain 

E(SURE(M A (y))-SE(/I A ( 2 ;))) 2 = 2n<j 4 + 4<r 2 E (SE(/x A (y))) - 4<t 4 E (|7* |) 

= 2no- 4 + 4ff 2 E(SURE(M A ( 2 /))) -4d 4 E(|r|) 
(by using @) = -2n ( 7 4 + 4a 2 E(p A ( 2/ )-j/|| 2 )+4 ( T 4 E(|r|). 

On the other hand, since x* x (y) is a minimizer of the Lasso problem ( |Pi(y, A)) , we observe that 

l\\Mv) -v\\l < l\\fix(v) + A|K(y)lli < -y\\l + A||o|| a = l -\\ y \\l 

Therefore, we have 

E {\\MV) - 2/111) < E (||y||i) = na 2 + H/ill 2 . (47) 
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Then, since |/*| = 0(n) and from (47), we have 



/SURE(£ A (y))-SEQi A (y))\ 2 \ ^ 6 , 4||/x 




E ^ <Z+^Bf (48) 



Finally, since \\fi\\2 < +00, we can deduce that 



E // SUREffi A (y))-SE(/2 A ( ?/ )) \ 2 \ =0 (l 



□ 



6 Discussion 

In this paper we proved that the number of nonzero coefficients of a particular solution of the 
Lasso problem is an unbiased estimate of the degrees of freedom of the Lasso response for linear 
regression models. This result covers both the over and underdetermined cases. This was achieved 
through a divergence formula, valid almost everywhere except on a set of measure zero. We gave 
a precise characterization of this set, and the latter turns out to be larger than the set of all the 
vectors associated to the transition points considered in [32] in the overdetermined case. We also 
highlight the fact that even in the overdetermined case, the set of transition points is not sufficient 
for the divergence formula to hold. 

We think that some techniques developed in this article can be applied to derive the degrees of 
freedom of other nonlinear estimating procedures. Typically, a natural extension of this work is to 
consider other penalties such as those promoting structured sparsity, e.g. the group Lasso. 

Acknowledgement This work was partly funded by the ANR grant Natlmages, ANR-08-EMER- 
009. 



References 

[1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. 
Second International Symposium on Information Theory 267-281. 

[2] Bickel, P. J., Ritov, Y., and Tsybakov, A., (2009). Simultaneous analysis of Lasso and Dantzig 
selector. Annals of Statistics. 37 170571732. 

[3] Craven, P. and Wahba, G. (1979). Smoothing Noisy Data with Spline Functions: estimating 
the correct degree of smoothing by the method of generalized cross validation. Numerische 
Mathematik 31, 377-403. 

[4] Daubechics, I., Dcfrise, M., and Mol, C. D. (2004). An iterative thresholding algorithm for 
linear inverse problems with a sparsity constraint, Communications on Pure and Applied 
Mathematics 57, 1413-1541. 

[5] Dossal, C (2007). A necessary and sufficient condition for exact recovery by 11 minimization. 
Technical report, HAL-00164738:1. 

[6] Efron, B. (2004). The estimation of prediction error: Covariance penalties and cross-validation 
(with discussion). J. Amer. Statist. Assoc. 99 619-642. 



15 



[7] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with 
discussion). Ann. Statist. 32 407-499. 

[8] Efron, B. (1981). How biased is the apparent error rate of a prediction rule. J. Amer. Statist. 
Assoc. vol. 81 pp. 461-470. 

[9] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle 
properties. J. Amer. Statist. Assoc. 96 1348-1360. 

[10] Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of 
parameters. The Annals of Statistics, 32(3), 928-961. 

[11] Fuchs, J. J. (2004). On sparse representations in arbitrary redundant bases. IEEE Trans. 
Inform. Theory, vol. 50, no. 6, pp. 1341-1344. 

[12] Kato, K. (2009). On the degrees of freedom in shrinkage estimation. Journal of Multivariate 
Analysis 100(7), 1338-1352. 

[13] Luisier, F. (2009). The SURE-LET approach to image denoising. Ph.D. dissertation, EPFL, 
Lausanne. Available: http://library.epfl. ch/theses/?nr=4566 

[14] Mallows, C. (1973). Some comments on C p . Technometrics 15, 661-675. 

[15] Meyer, M. and Woodroofe, M. (2000). On the degrees of freedom in shape restricted regression. 
Ann. Statist. 28 1083-1104 

[16] Nardi, Y. and Rinaldo, A (2008). On the asymptotic properties of the group Lasso estimator 
for linear models. Electronic Journal of Statistics, 2 605-633. 

[17] Osborne, M., Presnell, B. and Turlach, B. (2000a). A new approach to variable selection in 
least squares problems. IMA J. Numer. Anal. 20 389-403. 

[18] Osborne, M. R., Presnell, B. and Turlach, B. (2000b). On the LASSO and its dual. J. Comput. 
Graph. Statist. 9 319-337. 

[19] Ravikumar, P., Liu, H., Lafferty, J., and Wasserman, L (2008). Spam: Sparse additive models. 
In Advances in Neural Information Processing Systems (NIPS), volume 22. 

[20] Rosset, S., Zhu, J., Hastie, T. (2004). Boosting as a Regularized Path to a Maximum Margin 
Classifier. J. Mach. Learn. Res. 5 941-973. 

[21] Sardy, S., Bruce, A., and Tseng, P. (2000). Block coordinate relaxation methods for nonpara- 
metric wavelet denoising. J. of Comp. Graph. Stat. 9(2) 3617379. 

[22] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. 

[23] Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 
9 1135-1151. 

[24] Tibshirani, R. and Taylor, J. (2011). The Solution Path of the Generalized Lasso. Annals of 
Statistics. In Press. 

[25] Tibshirani, R. and Taylor, J. (2012). Degrees of Freedom in Lasso Problems. Technical report, 
larXiv:1111.0653l 



16 



[26] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. 
Ser. B 58(1) 267-288. 

[27] Tropp J. A. (2006). Just relax: convex programming methods for identifying sparse signals in 
noise, IEEE Trans. Info. Theory 52 (3), 1030-1051. 

[28] Vaiter, S., Peyre, G., Dossal, C. and Fadili, M.J. (2011), Robust sparse analysis regularization. 
larXiv: 1109.62221 

[29] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped 
variables. J. Roy. Statist. Soc. Ser. B 68 49-67. 

[30] Zhao, P. and Bin, Y. (2006). On model selection consistency of Lasso. Journal of Machine 
Learning Research, 7, 2541-2563. 

[31] Zou, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statis- 
tical Association, 101(476), 1418-1429 

[32] Zou, FL, Hastie, T. and Tibshirani, R. (2007). On the "degrees of freedom" of the Lasso. Ann. 
Statist. Vol. 35, No. 5. 2173-2192. 



17 



